Welcome back!!!

Recap of the part 1.

There are a few chapters,

  1. Introduction (Hello World!)
  2. Install R-studio – We successfully installed R-studio and familiarized ourselves with its interface.
  3. Yes, you are ready
  4. Data preparation + visualization
  5. Analysis of the observations
  6. Good luck :)

For part 2, this might be a long take, but it will be fun!!

So let’s open R-studio as of screenshot, and run all lines start with “library()”

REMEMBER, when you turn off your Nintendo switch, next day you wanted to play Super mario. Then you do not need to buy the game again, but you need to start the game again, right? So same as RM’s game, if you wanted to jump and eat mushrooms, you need to start the game with all the library() to start the game!

# Load the tidyverse package (game)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the ggplot2 package (game)
library(ggplot2)
# Load the ggpmisc package (game)
library(ggpmisc)
## Warning: package 'ggpmisc' was built under R version 4.3.3
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## Registered S3 method overwritten by 'ggpmisc':
##   method                  from   
##   as.character.polynomial polynom
# Load the gganimate package (game)
library(gganimate)
# Load the animation package (game)
library(animation)
# Load the animation package (game)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

2. Yes, you are ready

Let me remind you about the goal of our MLRS 101

GOAL: learn how to make some results (plots / stats) with R-studio to answer "Any updates?" 

Soooooo, before we jump to real genomics data, let’s start with some familiar data to make some figure and numbers.

Before we jump into any kinds of data, do you know who is this?

He is an Atlanta Hawks’ NBA player Tray Young who likes to throw deep three pointers.

If you don’t watch NBA, too bad…. IF you are interest in other sports games or teams you can apply some part of the code I wrote here to do some fun activities! But for today let’s start with Hawks (RM) vs. Knicks (Maggie x Jason) NBA game to make some results!!
What we are going to do is to check how scores changed over time from March 11th, 2020 game that ATL Hawks (RM) played against NY Knicks (Jason x Maggie) at State Farm Arena Atlanta Georgia.

NOW! We need NBA game data that we can compare the scores changed between two teams with R-studio.

3. Data preparation + visualization

1) Download the data

Here: you can download NBA data https://sports-statistics.com/database/basketball-data/nba/2019-20_pbp.csv

I saved this csv in the “Downloads” folder and I did not change the name of the file (“2019-20_pbp.csv”)

Before you bring the data to R-studio let’s open our CSV file and see what’s in there.

First row is showing all the names of the columns and from the second rows we can see the data. When you scroll to the right, you can see that this data is not only showing scores, but also including other records i.e. fouls, miss, and other plays.

2) Load the data to R-studio

Our file is csv file, so we have to use “read.csv()” function to bring this file to R-studio. Different format requires different function to bring data into R-studio ie. read.table(), readRDS(), read.delim(), ….etc

Before you run the following code, check the top right corner Environment tab (should say ‘Environment is empty’ before you run). If there is something in the environment, click the broom to clear the environment. A pop-up will ask to confirm removing and you should say yes. As it tells you, you cannot get objects in your environment back. But that’s ok because you will have a saved script (code file) to generate that object again if needed. So far the script has loaded packages.

# Here we use the function "read.csv" to read/load the csv into Rstudio. 
# The path to the file is given as the argument to the function.
# "~" points to your working directory. Let's find out where that is.
path.expand('~')
## [1] "/Users/jasonyoo"
# Writing "~" first means you don't have to add the whole beginning of the path (probably "C:/Users/you/Documents"), just start your instructions to the file from where ~ ends.
NBA_19_20 <- read.csv("~/Downloads/2019-20_pbp.csv")

After you run “NBA_19_20 <- read.csv(”~/Downloads/2019-20_pbp.csv”)” you will see a new object called “NBA_19_20”.

Environment tab is very useful UI that R-studio provides; R programming language uses object-oriented programming (OOP) concepts. This is like you choose Mario to play Super mario, after you choose your Mario now you can use other functions to make this Mario jump to dodge or eat mushrooms.

3) Check the data

We want to see how our data (Mario) look like, how big, color of the hat… etc.

## Let's see how big is our Mario (data) with a function called dim()
dim(NBA_19_20)
## [1] 539265     41

The result is showing the number of the rows and columns, there are 539,265 rows and 41 columns. These numbers should match with the numbers you’ll see it from the “Environment” tab.

Let’s check like top six rows from our data (Mario) with head() function. If you want to see the last six rows from our data, you can try tail(NBA_19_20).

## head() function to check the top rows
# Default is six, but if you want to see top 20 rows then you can write like head(NBA_19_20, n=20) 
head(NBA_19_20) 
##                            URL GameType                        Location
## 1 /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 2 /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 3 /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 4 /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 5 /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 6 /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
##              Date    Time WinningTeam Quarter SecLeft AwayTeam
## 1 October 22 2019 8:00 PM         TOR       1     720      NOP
## 2 October 22 2019 8:00 PM         TOR       1     708      NOP
## 3 October 22 2019 8:00 PM         TOR       1     707      NOP
## 4 October 22 2019 8:00 PM         TOR       1     707      NOP
## 5 October 22 2019 8:00 PM         TOR       1     689      NOP
## 6 October 22 2019 8:00 PM         TOR       1     685      NOP
##                                                       AwayPlay AwayScore
## 1 Jump ball: D. Favors vs. M. Gasol (L. Ball gains possession)         0
## 2                     L. Ball misses 2-pt jump shot from 11 ft         0
## 3                               Offensive rebound by D. Favors         0
## 4                            D. Favors makes 2-pt layup at rim         2
## 5                                                                      2
## 6                               Defensive rebound by J. Redick         2
##   HomeTeam                               HomePlay HomeScore
## 1      TOR                                                0
## 2      TOR                                                0
## 3      TOR                                                0
## 4      TOR                                                0
## 5      TOR O. Anunoby misses 2-pt layup from 3 ft         0
## 6      TOR                                                0
##                  Shooter       ShotType ShotOutcome ShotDist Assister Blocker
## 1                                                         NA                 
## 2     L. Ball - balllo01 2-pt jump shot        miss       11                 
## 3                                                         NA                 
## 4  D. Favors - favorde01     2-pt layup        make        0                 
## 5 O. Anunoby - anunoog01     2-pt layup        miss        3                 
## 6                                                         NA                 
##   FoulType Fouler Fouled             Rebounder ReboundType ViolationPlayer
## 1                                                                         
## 2                                                                         
## 3                        D. Favors - favorde01   offensive                
## 4                                                                         
## 5                                                                         
## 6                        J. Redick - redicjj01   defensive                
##   ViolationType TimeoutTeam FreeThrowShooter FreeThrowOutcome FreeThrowNum
## 1                                                                         
## 2                                                                         
## 3                                                                         
## 4                                                                         
## 5                                                                         
## 6                                                                         
##   EnterGame LeaveGame TurnoverPlayer TurnoverType TurnoverCause TurnoverCauser
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6                                                                             
##      JumpballAwayPlayer   JumpballHomePlayer       JumpballPoss  X
## 1 D. Favors - favorde01 M. Gasol - gasolma01 L. Ball - balllo01 NA
## 2                                                               NA
## 3                                                               NA
## 4                                                               NA
## 5                                                               NA
## 6                                                               NA

Here is more organized version of the data you can compare the data from your Environment.

# Here we specify to look for the function "paged_table" in the package "rmarkdown"
# If you do not have the entire package loaded, you can access just this function
# For example, maybe we want to use Mario's car, but not install all of mariokart, so we might use mariokart::car

rmarkdown::paged_table(NBA_19_20)

As you can see this is quite too much information for us, so let’s take a look what kinds of information we can get out from this data that suits our goal through the name of columns.

colnames(NBA_19_20)
##  [1] "URL"                "GameType"           "Location"          
##  [4] "Date"               "Time"               "WinningTeam"       
##  [7] "Quarter"            "SecLeft"            "AwayTeam"          
## [10] "AwayPlay"           "AwayScore"          "HomeTeam"          
## [13] "HomePlay"           "HomeScore"          "Shooter"           
## [16] "ShotType"           "ShotOutcome"        "ShotDist"          
## [19] "Assister"           "Blocker"            "FoulType"          
## [22] "Fouler"             "Fouled"             "Rebounder"         
## [25] "ReboundType"        "ViolationPlayer"    "ViolationType"     
## [28] "TimeoutTeam"        "FreeThrowShooter"   "FreeThrowOutcome"  
## [31] "FreeThrowNum"       "EnterGame"          "LeaveGame"         
## [34] "TurnoverPlayer"     "TurnoverType"       "TurnoverCause"     
## [37] "TurnoverCauser"     "JumpballAwayPlayer" "JumpballHomePlayer"
## [40] "JumpballPoss"       "X"

Based on these information we need a few columns to check how scores changed over time. There are a few information we need to make this as of plot.

  1. Location: State Farm Arena Atlanta Georgia

  2. Date: March 11 2020

  3. Quarter: which quarter they are playing

  4. SecLeft: how many seconds left per quarter

  5. AwayTeam: NYK

  6. AwayScore: we will plot this as dependent variable

  7. HomeTeam: ATL

  8. HomeScore: we will plot this as dependent variable

  9. ShotType: for the summary outcome

  10. ShotOutcome: for the summary outcome

  11. HomePlay: for the summary outcome

  12. AwayPlay: for the summary outcome

Let’s learn how to shrink our massive data by selecting columns only we want!

** you should select columns based on your purpose all the time!

4) Select the columns

In this case, we are going to select eight columns; Location, Date, Quarter, SecLeft, AwayTeam, AwayScore, HomeTeam, HomeScore, ShotType, ShotOutcome, HomePlay, AwayPlay +. might have to add a description about pipe function….

library(dplyr)
# Let's select 12 columns with select function. 
NBA_19_20_with_12cols <- NBA_19_20 %>% select(Location, Date,
                                        Quarter, SecLeft, 
                                        AwayTeam, AwayScore,
                                        HomeTeam, HomeScore,
                                        ShotType, ShotOutcome,
                                        HomePlay, AwayPlay
                                        )

# Here we have used the pipe "%>%" which was loaded thanks to tidyverse. This weird set of symbols says "use the thing before %>% as an argument for the following function"
# Whenever you use the pipe, there is another valid way you could have written the function. Let's write it without the pipe as well
NBA_19_20_with_12cols <- select(NBA_19_20, Location, Date,
                                        Quarter, SecLeft, 
                                        AwayTeam, AwayScore,
                                        HomeTeam, HomeScore,
                                        ShotType, ShotOutcome,
                                        HomePlay, AwayPlay)

# So why would you use the pipe or not? I think the first way is easier to read. We know that before the pipe is the dataset we are selecting from, and everything inside the parentheses is the columns we are selecting.
# Readability becomes even more obvious when you use the pipe to chain functions together. For example, we could read in the csv and select all at once with the pipe
NBA_19_20_with_12cols <- read.csv("~/Downloads/2019-20_pbp.csv") %>% 
                        select(Location, Date, Quarter, 
                               SecLeft, AwayTeam, AwayScore, 
                               HomeTeam, HomeScore,
                               ShotType, ShotOutcome,
                               HomePlay, AwayPlay)

# Which can also be written without the pipe
NBA_19_20_with_12cols <- select(read.csv("~/Downloads/2019-20_pbp.csv"), Location, Date,
                               Quarter, SecLeft, AwayTeam,
                               AwayScore, HomeTeam, HomeScore,
                               ShotType, ShotOutcome,
                               HomePlay, AwayPlay)

# So should you chain all your commands together forever? No, probably not.
# For example in this case, by selecting when we read in the csv we never get the chance to use dims, head, colnames, etc. to view the whole dataset

Pause here,

now you can see two object on the right top “Environment” tab, one is “NBA_19_20” the other is “NBA_19_20_with_12cols”. “NBA_19_20” object has 539,265 rows and 41 columns, but “NBA_19_20_with_10cols” got 539,265 rows and 12 columns that we selected with “select()” function. This is one of the ways to check how our object changed before and after our code.

Compare your data this table.

rmarkdown::paged_table(NBA_19_20_with_12cols)

We can use the object with 539,265 rows ten columns, but we still have some rooms to optimize our process. We can narrow it down to the game that happened only in Atlanta, so we need to filter the games that happened at State Farm Arena. (For this case, we only have 539,265 rows, so it’s not a big of deal. But what if you have a data with 3 million rows? This is will significantly delaying data processing time, so it’s better to learn how to grab rows what we need / want. )

5) Filter the rows

5-1) Games from State Farm Arena

In this case, we are going to use function called “filter()” to filter out the games that happened in “State Farm Arena Atlanta Georgia”. The filter() function requires two different components, one is object that we wanted to filter and the other is the condition. So in this case we wanted to filter from “NBA_19_20_with_12cols” object based on the “Location” column is equal to “State Farm Arena Atlanta Georgia”.

NBA_19_20_only_SFA <- filter(NBA_19_20_with_12cols, Location == "State Farm Arena Atlanta Georgia")

# How would you write this using the pipe?
NBA_19_20_only_SFA <- NBA_19_20_with_12cols %>% filter(Location == "State Farm Arena Atlanta Georgia")

Welcome to State Farm Arena

Let’s check our data with head function. In addition when you check your “Environment” tab, you can see that your “NBA_19_20_only_SFA” has 16,929 rows and 12 columns.

head(NBA_19_20_only_SFA) 
##                           Location            Date Quarter SecLeft AwayTeam
## 1 State Farm Arena Atlanta Georgia October 26 2019       1     720      ORL
## 2 State Farm Arena Atlanta Georgia October 26 2019       1     707      ORL
## 3 State Farm Arena Atlanta Georgia October 26 2019       1     688      ORL
## 4 State Farm Arena Atlanta Georgia October 26 2019       1     683      ORL
## 5 State Farm Arena Atlanta Georgia October 26 2019       1     683      ORL
## 6 State Farm Arena Atlanta Georgia October 26 2019       1     670      ORL
##   AwayScore HomeTeam HomeScore       ShotType ShotOutcome
## 1         0      ATL         0                           
## 2         0      ATL         2 2-pt jump shot        make
## 3         0      ATL         2     2-pt layup        miss
## 4         0      ATL         2                           
## 5         2      ATL         2     2-pt layup        make
## 6         2      ATL         4 2-pt jump shot        make
##                                   HomePlay
## 1                                         
## 2 T. Young makes 2-pt jump shot from 14 ft
## 3                                         
## 4                                         
## 5                                         
## 6 T. Young makes 2-pt jump shot from 18 ft
##                                                     AwayPlay
## 1 Jump ball: N. Vuevi vs. A. Len (T. Young gains possession)
## 2                                                           
## 3 A. Gordon misses 2-pt layup from 2 ft (block by D. Hunter)
## 4                             Offensive rebound by A. Gordon
## 5                       A. Gordon makes 2-pt layup from 1 ft
## 6
## again head is going to show you first six rows from the "NBA_19_20_only_SFA" data. 

Let’s learn one more function that is quite useful to check that we only filter the NBA games from State-Farm Arena. We are going to check with print() and unique() function. The print() function will give you the output in the Console tab. The unique() function return unique values from out input. So the following code is trying to print unique values from the “Location” column of “NBA_19_20_only_SFA” data.

print(unique(NBA_19_20_only_SFA$Location))
## [1] "State Farm Arena Atlanta Georgia"

5-2) Games at March 11th 2020

Second filter the game that happened at “March 11 2020” and check our data with head() and unique() functions.

print(unique(NBA_19_20_only_SFA$Date))
##  [1] "October 26 2019"  "October 28 2019"  "October 31 2019"  "November 5 2019" 
##  [5] "November 6 2019"  "November 8 2019"  "November 20 2019" "November 23 2019"
##  [9] "November 25 2019" "December 2 2019"  "December 4 2019"  "December 13 2019"
## [13] "December 15 2019" "December 19 2019" "December 27 2019" "January 4 2020"  
## [17] "January 6 2020"   "January 8 2020"   "January 14 2020"  "January 18 2020" 
## [21] "January 20 2020"  "January 22 2020"  "January 26 2020"  "January 30 2020" 
## [25] "February 3 2020"  "February 9 2020"  "February 20 2020" "February 22 2020"
## [29] "February 26 2020" "February 28 2020" "February 29 2020" "March 2 2020"    
## [33] "March 9 2020"     "March 11 2020"
NBA_19_20_SAF_March_11 <- filter(NBA_19_20_only_SFA, Date == "March 11 2020") 
head(NBA_19_20_SAF_March_11)
##                           Location          Date Quarter SecLeft AwayTeam
## 1 State Farm Arena Atlanta Georgia March 11 2020       1     716      NYK
## 2 State Farm Arena Atlanta Georgia March 11 2020       1     708      NYK
## 3 State Farm Arena Atlanta Georgia March 11 2020       1     705      NYK
## 4 State Farm Arena Atlanta Georgia March 11 2020       1     699      NYK
## 5 State Farm Arena Atlanta Georgia March 11 2020       1     696      NYK
## 6 State Farm Arena Atlanta Georgia March 11 2020       1     690      NYK
##   AwayScore HomeTeam HomeScore       ShotType ShotOutcome
## 1         0      ATL         0                           
## 2         0      ATL         0     2-pt layup        miss
## 3         0      ATL         0                           
## 4         0      ATL         0 3-pt jump shot        miss
## 5         0      ATL         0                           
## 6         0      ATL         0 2-pt jump shot        miss
##                                     HomePlay
## 1                                           
## 2                                           
## 3             Defensive rebound by D. Dedmon
## 4 D. Dedmon misses 3-pt jump shot from 25 ft
## 5                  Offensive rebound by Team
## 6  T. Young misses 2-pt jump shot from 16 ft
##                                  AwayPlay
## 1                                        
## 2 M. Harkless misses 2-pt layup from 1 ft
## 3                                        
## 4                                        
## 5                                        
## 6
print(unique(NBA_19_20_SAF_March_11$Date))
## [1] "March 11 2020"

As you can see it from your “Environment” tab your data “NBA_19_20_SAF_March_11” is now with 533 rows and 8 columns.

6) Add new column

Our goal is to make a figure that is showing scores changing over time. So we need two different data, one is home and away teams scores and the other is time. So we are going to use SecLeft as our independent variable (x-axis value), and AwayScore and HomeScore columns for dependent variable (y-axis) value. This is what our final figure will look like!

6-1) Visualize what we have

First, let’s see how ATL Hawks score changes over time with HomeScore and SecLeft columns. We are going to use functions ggplot() and geom_line(). Using ggplot is like playing with Lego build your own castle. You can start with simple blocks, and keep adding more to change the color or add more dots to your plot.

We start each ggplot figure with ggplot(). This is the foundational block that we will add other blocks onto. First we need to tell ggplot where to look for the data, then we can tell it how to display the data. Anything we specify in this foundational block will be used in later blocks as well.

library(ggplot2)
# We tell ggplot that out data is "NBA_19_20_SAF_Match_11" and that "SecLeft" should be displayed on the x axis.
HAWKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(x = SecLeft)) +
                              geom_line(aes(y = HomeScore)) # Then we ADD geom_line to our foundational block. geom_line allows us to make... line graphs! We have already specified that SecLeft will be the x axis, so now we tell ggplot that HomeScore will be the y axis.

# Let's display the plot
HAWKS_plot

6-2) Understand what we have

As you can see it from “HAWKS_plot” results, this is not what we wanted to see. There are some patterns, but this is not what we wanted. Let’s check the data we plotted here; we are going to see SecLeft and HomeScore from our NBA_19_20_SAF_March_11 data. If you know about the basketball, quarter is also related to time, so let’s get that column too.

NBA_19_20_SAF_March_11_time_n_score <- NBA_19_20_SAF_March_11 %>% select(Quarter, SecLeft, HomeScore)
NBA_19_20_SAF_March_11_time_n_score[order(NBA_19_20_SAF_March_11_time_n_score$SecLeft),] %>% head()
##     Quarter SecLeft HomeScore
## 125       1       0        24
## 233       2       0        50
## 234       2       0        50
## 235       2       0        50
## 350       3       0        78
## 472       4       0       118

Check your console, you can see like the following screen shot. The record flagging about the Quarter they played and seconds reducing from total seconds per quarter. This is why the scores of the “HAWKS_plot” oscillate from 700 seconds to 0. This means that if we wanted to see the score increase overtime we need to create one more column that increase over time that takes into account which quarter the game is in, not just how many seconds are left in that quarter.

6-3) Think about how

We want to plot with over time; however, the original data split by quarter and seconds they played. So we are going to manipulate a new column called “PlayTime” and put 0 for the first record.

We have to figure out how cumulative seconds and make SecLeft in to PlayTime. – This is just one way to solve this problem, and you will have to figure out a solution that works for your data in the future!

Also this step might be the most time consuming part in the future for you, but REMEBER there is no “THE ANSWER”.

What we are going to do is find the total play time per quarter and subtract SecLeft column.

One example from our we have Quarter == 1, SecLeft == 716. In this case, we are going to find the max SecLeft from Q1, then we can think like following:

PlayTime for Q1 = max(Q1, SecLeft) - SecLeft.

However, play from second quarter, we have to do like following:

PlayTime for Q2 = max(Q1, SecLeft) + max(Q2, SecLeft) - SecLeft

For computing and store as object Max SecLeft per quarter with max() and filter() functions.

Q1_time <- max(filter(NBA_19_20_SAF_March_11, Quarter == 1)$SecLeft)
Q2_time <- Q1_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 2)$SecLeft)
Q3_time <- Q2_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 3)$SecLeft)
Q4_time <- Q3_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 4)$SecLeft)

They played overtime, so we have to compute Q5_time too!

Q5_time <- Q4_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 5)$SecLeft)

6-4) Add new column with conditions.

NBA_19_20_SAF_March_11$PlayTime <-with(NBA_19_20_SAF_March_11,
                                       ## overtime subtract from the total second
                                        ifelse(Quarter == 5, Q5_time - NBA_19_20_SAF_March_11$SecLeft, 
                                            ifelse(Quarter == 4, Q4_time - NBA_19_20_SAF_March_11$SecLeft,
                                              ifelse(Quarter == 3, Q3_time - NBA_19_20_SAF_March_11$SecLeft,
                                                ifelse(Quarter == 2, Q2_time - NBA_19_20_SAF_March_11$SecLeft, 
                                                       Q1_time - NBA_19_20_SAF_March_11$SecLeft ) ) ) )
                                       )

This some of the rows are related to foul calls, so we have to make this into unique per row with pipe function (%>%) and distinct()

## 533 rows now became 416 rows. 
NBA_19_20_SAF_March_11 <- NBA_19_20_SAF_March_11 %>% distinct()

Let’s plot same code what we used from 6-1) section with our new time column.

HAWKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(PlayTime)) +   
                              geom_line(aes(y = HomeScore))
HAWKS_plot

Looks great to me :) now HAWKS score is increasing over time !!!

Let’s add more things to our new plot like Lego. First, let’s add AwayScore to our plot.

HAWKS_KNICKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(x = PlayTime)) +   
                              geom_line(aes(y = HomeScore)) +
                              geom_line(aes(y = AwayScore)) # We add another line to ggplot. Remember x is already PlayTime, but for this line, AwayScore will be the y value rather than HomeScore

HAWKS_KNICKS_plot

7) Detail is Everything

Our “HAWKS_KNICKS_plot” looks promising, but we can enhance it. If a plot contains valuable information but is only understandable to you, it needs improvement. Let’s enhance readability by adding colors to our lines, increasing their thickness, and incorporating vertical lines to mark the start of each quarter.

NBA_19_20_SAF_March_11$PlayTime <- as.numeric(NBA_19_20_SAF_March_11$PlayTime)

HAWKS_vs_KNICKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(PlayTime)) +   
                              geom_line(aes(y = AwayScore), color = "orange", size = 2.5) + 
                              geom_line(aes(y = HomeScore), color = "red", size = 2.5) +
                              ## Add quarter lines
                              geom_vline(xintercept = Q1_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q2_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q3_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q4_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q5_time, colour="black", linetype = "longdash")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
HAWKS_vs_KNICKS_plot

8) Final product with animation

## make a score board for the game
score_df <- tibble(NYK = max(NBA_19_20_SAF_March_11$AwayScore),
                    ATL = max(NBA_19_20_SAF_March_11$HomeScore))

HAWKS_vs_KNICKS_final_plot <- ggplot(NBA_19_20_SAF_March_11, aes(PlayTime)) +   
                              geom_line(aes(y = AwayScore, group = AwayTeam),  color = "orange", size = 2.5) + 
                              geom_line(aes(y = HomeScore, group = HomeTeam), color = "red", size = 2.5) + 
                              ## Add quarter lines
                              geom_vline(xintercept = Q1_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q2_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q3_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q4_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q5_time, colour="black", linetype = "longdash") +
                              labs(y= "Scores", x = "Time (seconds)") +
                              # Add table to the figure
                              annotate(geom = "table", x = 20, y = 140,
                                      label = list(score_df),
                                       vjust = 1, hjust = 0, size = 9,
                                       table.theme = ttheme_gtlight
                                       )


HAWKS_vs_KNICKS_final_plot_animated <- HAWKS_vs_KNICKS_final_plot +
                                  transition_reveal(PlayTime) +
                                  view_follow(fixed_y = TRUE) 

Based on the bottom figure, New York Knicks out played entire time, but Hawks caught up at the end of the Q4. Hawks went overtime to grasp the hope, but they lost from their home stadium ;///. – Sorry Richard.

HAWKS_vs_KNICKS_final_plot_animated

HAWKS_vs_KNICKS_final_plot

9) Let’s save our game

Imagine reaching level 10 of Mario, only to lose all progress! To prevent this, you’d save your game. Similarly, let’s save our DataFrame, “NBA_19_20_SAF_March_11,” to avoid starting over. We’ll save it in your “Downloads” folder as “NBA_19_20_SAF_March_11.rds.” The “.rds” extension is a file format, similar to how a Word document uses “.docx.”

# saveRDS(NBA_19_20_SAF_March_11, "~/Downloads/NBA_19_20_SAF_March_11.rds", compress = TRUE)

Once saved, you can close RStudio as we learned in the previous section.

End of Part 2!!

I’m so proud that you’ve made it this far!

Quick review of what we’ve done for Part 1 and 2.

  1. Introduction (Hello World!)
  2. Install R-studio

– We successfully installed R-studio and familiarized ourselves with its interface.

  1. Yes, you are ready
  2. Data preparation + visualization

– We successfully learned how to visualize data as desired (including data cleaning and manipulation).

For Part 3, we’ll dive into why the Hawks lost.

  1. Analysis of the observations
  2. Good luck :)

Here are some key points for you to consider - these will be great review questions:

  1. How can we analyze the 76ers’ games?
  2. How can we improve our final plot?
  3. Are there other ways to format the data besides the method we used?